Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves the deserialization speed by reducing object initialization overhead #284

Merged
merged 6 commits into from
Sep 19, 2023

Conversation

cheqianh
Copy link
Contributor

@cheqianh cheqianh commented Aug 9, 2023

Description:

Publish the PR for feedback but will keep it in draft mode until we are confident to merge this change. More changes and tests come soon.

This PR improved the library's performance by 40-50% through:
(1) migrating to new ionPyObjects without inheritance (+~20%),
(2) switching the way of object initialization from from_value() to constructor directly (+~10%), and
(3) optimizing Ion struct's underlying storage implementation (+~20%)

Unit tests

This PR also fixed more than 2k unit test failures due to the changes of IonPyObjects. Most of failures are because of (1) ion_nature inheritance, (2) from_value invocation, and (3) methods signature differences.

Detail:

In detail, create an IonPyObject requires a few steps which causes a lot of overhead. The overheads mainly come from (1) Ion_Nature inheritance, (2) from_value() method invocation overhead and (3) methods refactor overhead. In addition, Ion Struct, the Ion multiMap, produces a lot of overhead due to use of high level abstract classes such as OrderedDict, mutableMapping and MutableSequence. In the Ion Python C extension, we always create an empty Ion Struct first, and recursively call its "add_item" API to add new values. Every time inserting a value into the Struct, the code will check if this is the first value occurs for this key. If it's the first value for the key, it will create a new MultiSequence for the key, otherwise it will just insert the value into the list. Creating new mutableValues is extremely expensive when deserializing large Ion files.

So this PR first removed Ion_Nature inheritance and the reusable help functions for creating IonPyObject classes.

Then, all new IonPyObjects has its own constructor. The PR next switched the way of IonPyObjects initialization from from_value() to their constructors directly to avoid unnecessary method invocation overhead.

Last, the PR optimized the multiMap implementation by using native list instead of MutableSequence. In addition, since the Ion Struct is an unordered collection, we uses a Python dict directly instead of previous OrderedDict.

Benchmarking

The performance gain depends on various factors such as data model and size. Below are three sample data I used for benchmarking:

File_name Description Size (bytes) Before (s) After(s) Improvement
committed.10n a small Ion struct 148 0.19 0.13 31.6%
complex.10n an Ion struct with decimal values 5208 14.1 7.8 46.9%
legacy.10n complex nested Ion containers 22.3e^6 245 127 48%

Next Step

  1. There are more experiment and tests required, so we don't break any existing use case
  2. All above are focusing on the read(deserialization) process, try write process as well.
  3. more IonPyObjects migration for rest types such as timestamp and bytes.
  4. Using the benchmark-cli to generate an official benchmark result.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@JustinDaleGray JustinDaleGray self-requested a review August 10, 2023 00:48
@JustinDaleGray
Copy link

High Level Overview

  • Consider the use of __slots__ versus the use of __dicts__ or __weakref__. We can further reduce memory allocations by not creating a dictionary for attributes where possible. If a dictionary is explicitly required it can be declared within the slot.
  • Avoid unnecessary redundancy. Consider wrapping the type classes in module in a base class and moving similar methods. This will mean the use of polymorphism and using some vtable lookups. It might be useful to see what the penalty for DRY might be in this context.
  • Consider alternative data structures. For example, __setitem__(self, key, value) wraps the value in a list. If only the value is needed, then consider assigning it directly instead.
  • Avoid unnecessary conditional checks. Checking a conditional to set a value to None may be suboptimal.
  • I'm not keen on the foo_new() naming convention. Consider alternative naming conventions that adhere to the PEP 8 standard on class naming.

@rmarrowstone
Copy link
Contributor

At a high-level I think we need to commit to dropping some of the existing factory methods. I know it's arguably part of our API, so we'd need to rev at least minor, but probably major version, but I think it's worth it.

@rmarrowstone
Copy link
Contributor

Thanks for doing the work to get to this place! I'm super excited to see the benefits of this! Let's not be afraid to be bold in breaking API compatibility to do so.

from amazon.ion.symbols import SymbolToken


class IonPyNull_new(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may discover a reasonable "why" later, but barring that, ditch the _new suffix please.

Copy link
Contributor Author

@cheqianh cheqianh Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will keep this comment unsolved until renaming them.

I named them object, object_new and object_original so that we can separate them in the existing tests for easier debug and performance comparisons.

EDIT: Once we approve other comments within this PR, I'll rename these classes to their appropriate names

Copy link
Contributor

@rmarrowstone rmarrowstone Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm this is still the plan. [refers to resolved conversation about removing _new classes before actually merging]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll rename it once we address other comments. I'm using these class names for easier before/after performance comparison...

Copy link
Contributor Author

@cheqianh cheqianh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed the updated question. Would you also refer to removing the old IonPyObject? yeah I think we can completely switch to the _new IonPyObject and rename them to the official IonPyObject. This won't cause breaking changes. (I'm not sure if the pure python implementation still need old IonPyObjects. If they are, I'll leave a comment for further discussion otherwise I'll safely deprecate these old IonPyObjects, maybe a separate sub-PR targeting this one for all the clean up work)

I left draft comments and tried to send all of them at once but seems the comment doesn't cache draft once I click resolve conversation.

Copy link
Contributor Author

@cheqianh cheqianh Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit b6fe2df renames IonPyObject_new to IonPyObject, we can safely remove IonPyObject_original now.

amazon/ion/ion_py_objects.py Outdated Show resolved Hide resolved
amazon/ion/ion_py_objects.py Outdated Show resolved Hide resolved
amazon/ion/ion_py_objects.py Outdated Show resolved Hide resolved
amazon/ion/ion_py_objects.py Outdated Show resolved Hide resolved
amazon/ion/ion_py_objects.py Outdated Show resolved Hide resolved
amazon/ion/ioncmodule.c Show resolved Hide resolved
amazon/ion/ioncmodule.c Outdated Show resolved Hide resolved
amazon/ion/equivalence.py Outdated Show resolved Hide resolved
tests/test_core_multimap.py Outdated Show resolved Hide resolved
@cheqianh
Copy link
Contributor Author

cheqianh commented Aug 10, 2023

Okay for this PR,

  1. the next commit will address above comments.
  2. As for the scope and goal of this PR, we are going to make break changes and completely deprecate old methods such as from_value. We will push the minor changes first. We will deprecate old methods and make breaking changes in the next PR.

  - Refactors some class attributes and renames tests and methods
  - Removes _copy()
  - Refactors ionPyDict's constrcutor signature
  - Changes ion_type to be a class attribute
tests/test_core_multimap.py Outdated Show resolved Hide resolved
@@ -26,7 +26,7 @@
from amazon.ion.symbols import SymbolToken


def isinstance_of_ion_nature(obj):
def isinstance_of_ion_nature_or_new_ion_py_objects(obj):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method will be deprecated with other old methods later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... But what will replace it? I ask because I think that would be either just trying to get the iontype and annotations or testing if the two objects hasattr them. So why not do that now and avoid the series of ors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, refactored it to check if both ion_type and ion_annotations attributes exist.

def __init__(self, ion_type, value, annotations=()):
def __init__(self, ion_type=IonType.LIST, value=None, annotations=()):
if value is None:
value = []
super().__init__(value)
self.ion_annotations = annotations
# it's possible to be Sexp
self.ion_type = ion_type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the constructor still can control ion_type as it's possible to be Sexp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call this IonPySequence or something, then? Or does this just need to conform to the existing naming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, It won't break any current use case since we can call it any name we want (E.g., we are calling it an unofficial name with the _new suffix). Since the original list and Sexp are called IonPyList, this might affect some write use case (Not sure, maybe for the read use case, people just need to use IonPyList as a native list so they don't need to worry about its class name?).

Or maybe we can redeclare it to ionPyList in simple_types.py first, then fully deprecate it when we make breaking changes in the future.

ionPyList = ionPySequence

@cheqianh cheqianh marked this pull request as ready for review September 5, 2023 20:58
@@ -26,7 +26,7 @@
from amazon.ion.symbols import SymbolToken


def isinstance_of_ion_nature(obj):
def isinstance_of_ion_nature_or_new_ion_py_objects(obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... But what will replace it? I ask because I think that would be either just trying to get the iontype and annotations or testing if the two objects hasattr them. So why not do that now and avoid the series of ors?

from amazon.ion.symbols import SymbolToken


class IonPyNull_new(object):
Copy link
Contributor

@rmarrowstone rmarrowstone Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm this is still the plan. [refers to resolved conversation about removing _new classes before actually merging]

ion_type = IonType.NULL

def __init__(self, ion_type=IonType.NULL, value=None, annotations=()):
self.ion_type = ion_type # TODO initialized to NULL type first, what's the real type?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the action here? I would say it's OK as it is... or suggest that for IonPyNull we could avoid the class attribute. But I don't understand if this points to a deeper issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

if isinstance(self, IonPyNull_new) or self.ion_type.is_container:
value = None

if in_struct:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we could avoid the logic and simplify the parameters by passing only passing field_name if it were in a struct. realize that may be a breaking change, although I'd like to understand what behavior is expected when passing a field_name with in_struct=False

Copy link
Contributor Author

@cheqianh cheqianh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't spend too much effort on the to_event API since it's mainly used by the pure python implementation, but it seems that we do not have to maintain in_struct. It requires more investigation to see if this will cause breaking changes (at least some low-level internal methods need this parameter but the top-level dump() doesn't).


def to_event(self, event_type, field_name=None, in_struct=False, depth=None):
value = self
if isinstance(self, IonPyNull_new) or self.ion_type.is_container:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would either of these be true?

Copy link
Contributor Author

@cheqianh cheqianh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_event was same as the original factory method, which was used by both container and non-container Ion types so it had this condition check. I refactored them.

self.__store = {}
if value is not None:
for key, value in iter(value.items()):
self.__store[key] = [value]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this overwrite when value.items() has multiple mappings for a field?

Copy link
Contributor Author

@cheqianh cheqianh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter value of the original multimap constructor is limited to Python's native dictionary. I added logic to support multi-value Map.

return self.__store[key][len(self.__store[key]) - 1] # Return only one in order not to break clients

def __delitem__(self, key):
del self.__store[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the contract for multiply-mapped fields?

Copy link
Contributor Author

@cheqianh cheqianh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It deletes everything map to the given key currently.

Comment on lines 535 to 539
def items(self):
output = []
for k, v in self.iteritems():
output.append((k, v))
return output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this exist?

Copy link
Contributor Author

@cheqianh cheqianh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you refer to items() vs.iteritems()?

Python 3 has deprecated iteritems for native dict. I'll see if we can delete it too.

Copy link
Contributor Author

@cheqianh cheqianh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for items():

The python 3 dict.items() returns a set-like object providing a view on the dictionary's items.

Our IonPydict.item() still keeps the python 2 behavior that returns a list (and as a result, have to maintain a list).

This Python 2/3 and dict/IonPyDict behavior difference need to be fixed or documented. I'll create a Github issue and see if we still need this method and why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created GH issue - #288

annotations=self.ion_annotations, depth=depth)


class IonPyDict_new(MutableMapping):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is existing pydoc about the contract for setting, deleting and accessing fields with multiple mappings please ensure it is moved. If not, now is a great time to add it! :)

Copy link
Contributor Author

@cheqianh cheqianh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I'm working on adding a method documents according to the current behavior.

amazon/ion/ion_py_objects.py Outdated Show resolved Hide resolved
@cheqianh
Copy link
Contributor Author

cheqianh commented Sep 7, 2023

A new commit is available below.

An overall comment: This PR tried to incorporate the performance improvement into the code with the least changes in case of any breaking change. Some methods are copied from the original IonPyObject but it's time to refactor/deprecate some of them! Also Added some documents.

Clean up work:

  • Remove old IonPyObjects and rename _new objects.
  • Deprecate from_value in the C extension.
  • After task 2 above, refactor constructor signature to drop unnecessary parameter.

Copy link
Contributor

@tgregg tgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on these performance improvements! Just minor comments from me.

Comment on lines 1583 to 1591
_ionpyint_fromvalue = _ionpyint_cls;
_ionpynull_fromvalue = _ionpynull_cls;
_ionpyfloat_fromvalue = _ionpyfloat_cls;
_ionpydecimal_fromvalue = _ionpydecimal_cls;
_ionpytext_fromvalue = _ionpytext_cls;
_ionpylist_fromvalue = _ionpylist_cls;
_ionpydict_fromvalue = _ionpydict_cls;
_ionpybool_fromvalue = _ionpybool_cls;
_ionpysymbol_fromvalue = _ionpysymbol_cls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the lines like

_ionpynull_fromvalue        = PyObject_GetAttrString(_ionpynull_cls, "from_value");

above, which these seem to overwrite?

Copy link
Contributor Author

@cheqianh cheqianh Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I added a new action item 2 to the next step list above.

Comment on lines 380 to 381
# generate_containers_binary(_SIMPLE_CONTAINER_MAP),
# generate_annotated_values_binary(SIMPLE_SCALARS_MAP_BINARY, _SIMPLE_CONTAINER_MAP),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these still be commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

__name__ = 'IonPyNull_new'
__qualname__ = 'IonPyNull_new'

def __init__(self, ion_type=IonType.NULL, value=None, annotations=()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop the value parameter? Or is it needed for consistency with the other __init__ implementations?

Copy link
Contributor Author

@cheqianh cheqianh Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to maintain the constructors' signatures identical to from_value's, so that we can perform an in-place replacement. This approach can also help us quantify the overhead caused by from_value. For example, switching from from_value to their constructors introduces an 8-12% improvement in performance.

value is not necessarily required after the PR is merged. I added a clean-up work 3 above to drop this parameter.

def __init__(self, ion_type, value, annotations=()):
def __init__(self, ion_type=IonType.LIST, value=None, annotations=()):
if value is None:
value = []
super().__init__(value)
self.ion_annotations = annotations
# it's possible to be Sexp
self.ion_type = ion_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call this IonPySequence or something, then? Or does this just need to conform to the existing naming?

* Deprecated `ion_nature`
* Migrated objects construction from `from_value()` to their constructors in C extension
* Added IonPytimestamp and IonPyBytes
* Replaced all the original IonPyObject and factory methods with the new IonPyObjects.
* Worked on the IonPyTimestamp signature to make it compatible with the amazon.core.Timestamp
* Updated the unit tests to accommodate the changes.
@cheqianh
Copy link
Contributor Author

The clean up commit did below:

  • Deprecated ion_nature
  • Migrated objects construction from from_value() to their constructors in C extension
  • Added IonPytimestamp and IonPyBytes
  • Replaced all the original IonPyObject and factory methods with the new IonPyObjects.
  • Worked on the IonPyTimestamp signature to make it compatible with the amazon.core.Timestamp
  • Updated the unit tests to accommodate the changes.

@cheqianh cheqianh merged commit 7a201c0 into amazon-ion:master Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants